AD-A151  970 
UNCLASSIFIED 


COMPLEX I TV  OF  DENSE  LINEAR  SVSTEM  SOLUTION  ON  A i/1  ^ 

MULTIPROCESSOR  RING(U)  VALE  UNIV  NEH  HAVEN  CT  DEPT  OF 
COMPUTER  SCIENCE  I  C  IPSEN  ET  AL.  JAN  85 

VALEU/DCS/RR-349  N00014-82-K-0184  F/G  12/1  NL 


MICROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  BURFMJ  OL  RIANDARPR  I%1-* 


REPRODUCED  AT  GOVERNMENT  EXPENSE 


&HHIH 

B*  Q' 


i  his  document  has  been  approved 
tot  public  release  and  sale;  its 
distribution  is  unlimited. 


& ' 


.ciCTE 


sf  APR  2  1985 


YALE  UNIVERSITY  A 

DEPARTMENT  OF  COMPUTER  SCIENCE 


or  03  04  065 


HI 

Abstract.  Different  algorithms,  based  on  Gaussian  elimination,  for  the  solution  of  dense  linear 
systems  of  equations,  are  discussed  for  a  multiprocessor  ring.  The  number  of  processors  is  assumed 
not  to  exceed  the  problem  size.  A  fairly  general  model  for  data  transfer  is  proposed  and  the 
algorithms  are  analysed  with  respect  to  their  requirements  of  arithmetic  as  well  as  communication 
times,  v 

^  >  I 
(}  1 


Complexity  of  Dense  Linear  System 
Solution  on  a  Multiprocessor  Ring 


Ilse  C.F.  Ipsen,  Youcef  Saad  and  Martin  H.  Schultz 

Research  Report  YALEU/DCS/RR-349 
January  1985 


This  work  was  supported  in  part  by  by  ONR  grant  N00014-82-K-0184  and  in  part  by  a  joint  study 
with  IBM/Kingston 


,  v:  1  - 


1.  Introduction 

This  paper  discusses  various  algorithms,  based  on  Gaussian  Elimination,  for  the  solution  of 
dense  linear  systems  of  equations, 

Ax  =■  b, 

on  a  linearly  connected  ring  of  general  purpose  processors. 

In  multiprocessor  systems,  the  total  time  to  perform  a  sequence  of  computational  tasks  does  not 
only  depend  on  when  a  task  is  completed  but  also  where  (i.e.  in  which  processor)  it  is  accomplished. 
This  in  turn  implies  a  great  richness  in  the  class  of  algorithms,  in  terms  of  the  assignments  of  tasks 
to  processors  and  the  assumed  topology  of  the  processor  communication  network.  For  a  particular 
task  it  is  now  important  in  which  processor  its  input  data  are  situated  (ie,  how  long  it  takes  to 
move  them  to  the  requesting  processor)  and  when  they  are  available. 

As  the  number  of  processors  will  not  exceed  the  problem  size,  and  will  usually  be  much  smaller, 
the  algorithms  differ  in  the  way  the  matrix  A  is  distributed  among  the  processors. 

1.1.  Overview 

The  approach  taken  here  for  the  development  and  analysis  of  algorithms  acknowledges  that 
times  for  data  transfer  communication  are  not  negligible  and  may  in  fact  dominate  the  times  for 
actual  arithmetic.  A  fairly  general  communication  model  is  proposed,  and  all  algorithms  are  char¬ 
acterised  and  compared  with  respect  to  their  requirements  for  arithmetic  as  well  as  communication. 

Following  the  classical  approach,  methods  for  triangular  system  solution  are  discussed  (sec¬ 
tion  3)  before  introducing  schemes  for  the  Gaussian  elimination  (section  4).  To  begin  with,  however, 
the  second  part  of  this  section  presents  a  summary  of  the  requisite  hardware  features,  based  on 
which  various  ways  of  transferring  data  can  be  devised  (section  2). 

To  avoid  long,  non-descriptive  formulae  in  favor  of  simpler  results,  the  derivation  of  arithmetic 
and  communication  times  will  contain  merely  high  order  terms  (in  the  problem  size  and  the  proces¬ 
sor  count).  Furthermore,  only  a  few  representative  methods  will  be  described  in  detail  to  illustrate 
their  analysis,  while  other,  obvious  variations,  will  be  listed  in  tables. 

Surveys  of  (general)  parallel  algorithms  for  the  direct  solution  of  dense  linear  systems  of  equa¬ 
tions  appear  in  (3,  9,  10].  Probably  the  earliest  paper  to  realise  that  ‘data  movement,  rather 
than  arithmetic  operations,  can  be  the  limiting  factor  in  the  performance  of  parallel  computers 
on  matrix  operations’  is  [2].  There,  lower  bounds  for  matrix  multiplication  and  matrix  inversion 
are  determined  for  arbitrary  processor  interconnection  schemes  where  each  processor  can  hold  one 
matrix  element.  In  [1],  the  communication  requirements  of  some  numerical  methods,  such  as  tridi¬ 
agonal  system  solution  by  substructuring,  ADI,  FFT  and  fast  Poisson  solvers,  are  analysed  with 
respect  to  shared-memory  multiprocessors  and  highly  parallel  non-shared-memory  MIMD  systems. 
A  probabilistic  model  for  predicting  iteration  time  and  optimal  data  allocation  when  solving  linear 
systems  via  iterative  methods  is  presented  in  [6],  Its  application  in  [7]  prompts  the  conclusion 
that  ‘a  broadcast  bus  architecture  can  effectively  reduce  the  expected  computation  time  for  solving 
sparse  linear  systems.’ 

Gaussian  elimination  for  dense  systems  on  a  multiprocessor  ring  is  discussed  in  [8]  Lawrie  and 
Sameh  [4]  present  a  technique  for  solving  symmetric  positive  definite  banded  systems,  which  is  a 
generalization  of  a  method  for  tridiagonal  system  solution  on  multiprocessors;  it  takes  advantage 
of  different  alignment  networks  for  allocating  data  to  the  memories  of  particular  processors.  How¬ 
ever,  the  analysis  in  both  is  based  on  the  assumption  that  the  time  for  transmitting  one  floating 
point  number  from  a  processor  to  its  nearest  neighbor  does  not  exceed  the  time  for  an  arithmetic 
.  operation. 

This  paper  lays  no  claims  to  being  either  exhaustive  or  complete.  Its  objective  is  to  compare 
a  variety  of  algorithms,  which  are  fairly  reasonable  to  program  and  to  analyse,  for  the  solution 
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Figure  1:  Ring  of  k  s*  8  Processors. 

^  of  a  single  problem  on  a  certain  class  of  parallel  architectures,  thereby  leading  to  a  more  realistic 
approach  to  future  algorithm  development  on  multiprocessor  machines. 

1.2.  The  Multiprocessor  Ring 

The  multiprocessor  architecture  under  consideration,  depicted  in  Figure  1,  was  introduced  in 
[11]  and  consists  of 

•  a  ‘ring’  of  a  small  number  of  k  linearly  connected  general  purpose  (possibly  pipelined)  proces¬ 
sors,  each  with  its  own  memory  (the  processors  in  the  ring  will  be  consecutively  numbered  Pi 
through  P*), 

•  a  fast  bus  used  mainly  for  broadcasting  or  transferring  data  at  high  speed, 

•  local  interconnections  linking  each  processor  to  its  nearest  neighbors  (the  ring) . 

It  is  assumed,  that  any  processor  is  capable  of  writing  to  one  neighbor  while  reading  from 
the  other,  using  the  local  links.  For  purposes  of  estimating  the  computation  time,  processors  are 
considered  to  work  in  lock  step  where  one  step  corresponds  to  the  computation  time  of  the  slowest 
processor. 

In  order  to  discern  the  merits  and  disadvantages  of  the  local  links  on  one  hand  and  the 
broadcast  bus  on  the  other,  they  will  be  used  separately  and  analysed  one  at  a  time. 

Assume  the  bus  has  a  speed  of  Rb  words  per  second  while  the  local  links  can  transfer  data  at 
a  rate  of  Rg  words  per  second.  The  inverses  of  Rg  and  Rg  are  denoted  by  tb  and  Tg  respectively. 
To  be  general,  each  transfer  of  a  data  packet  is  associated  with  a  constant  start-up  (set-up)  time  of 
0g  and  0g,  respectively,  which  is  independent  of  the  size  (the  number  of  words)  per  packet.  Often, 
the  start-up  times  are  (much)  larger  than  the  elemental  transfer  times,  that  is, 

0b*>tb, 
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The  time  to  broadcast  a  packet  of  size  N  via  the  bus  is 

tT,B  =  &B  +  NtB, 

while  the  time  to  send  it  from  a  processor  to  its  neighbor  by  using  the  local  links  is 

tT,L  =  0L  +  Ntl. 

On  a  single  processor,  a  linear  combination  of  two  vectors  of  length  N  takes  time 

tA  -  7  +  Nu, 

where  7  is  the  pipe  fill  time  (it  is  zero  for  non-pipelined  machines),  <*>  the  time  for  one  scalar 
operation  and  7  >  w  (again,  the  start-up  time  dominates  the  elemental  operation  time).  In  this 
paper,  <r,B  denotes  data  transfers  times  for  the  bus,  <r,L  refers  to  those  involving  the  local  links 
and  stands  for  arithmetic  times.  For  any  algorithm  the  sum  of  its  transfer  and  arithmetic  time, 
fr,.  +  tA,  is  simply  called  its  computation  time. 

Whenever  convenient,  we  assume  without  loss  of  generality  that  the  problem  size  N  is  a 
multiple  of  the  number  of  processors,  k. 

2.  Data  Transfers 

In  this  section  we  consider  different  ways  of  transferring  data  among  processors  which  are 
important  in  subsequent  computational  algorithms.  We  assume  that  a  vector  can  be  divided  up 
into  ‘packets’  of  arbitrary  size  (subject  to  the  vector  length,  of  course). 

As  mentioned  in  the  previous  section,  it  takes  time 

It,b  =  0b  +  Ntb 

to  broadcast  a  vector  of  length  N  from  one  processor  to  all  others  using  the  broadcast  bus.  Con¬ 
sequently,  the  time  to  broadcast  a  vector  is  independent  of  the  number  of  destination  processors. 

An  alternative  method  consists  of  using  only  the  local  links  between  the  processors  and  pipelin¬ 
ing  the  data  transfers  :  while  sending  one  packet  to  its  successor  a  processor  receives  the  next  packet 
from  its  predecessor.  Thus,  if  processor  P\  is  to  send  its  data  to  all  other  processors,  then  in  step  1 
the  first  packet  is  sent  from  Pi  to  /V  In  step  j,  the  first  packet  is  sent  from  Pj  to  P*  1  while  the 
second  packet  follows  up  from  Pj- 1  to  Pj,  etc. 

If  the  vector  is  partitioned  into  v  packets  of  equal  size,  then  the  process  will  terminate  after 
k  +  v  -  2  steps,  when  the  last  packet  has  reached  P*.  With  regard  to  high  order  terms  in  k  and  v, 
the  total  time  comes  to 

tr,L  »  (*  +  v)0L  +  (k  +  v)— tl.  (2.1) 

The  above  equation  indicates  that  for  large  enough  v  the  time  required  for  transferring  a  vector  of 
length  JV  is  proportional  to  Nt^.  However,  the  larger  u,  the  larger  the  cost  of  the  set-up  times  will 
be. 

Another  possibility  is  to  have  Pi  send  its  data  ‘both  ways  round’  so  that  processors  to  the  left 
and  right  of  Pi  would  receive  them  at  the  same  time.  The  according  data  transfer  time  comes  to 

*T,L  ( j*  +  v)0l  +  i\k  +  V)~TL' 

which  is  at  most  twice  as  fast  as  the  one  in  (2.1)  when  data  are  sent  to  all  k  processors.  If  fewer 
than  k  processors  are  to  receive  data,  this  scheme  becomes  more  complicated  and  vectors  might 


have  to  be  partitioned  into  packets  of  differing  size.  Since  the  difference  is  only  a  factor  of  two,  we 
prefer  to  restrict  ourselves  to  the  simpler  scheme  of  ‘one  way  data  flow.’ 

From  equation  (2.1)  one  observes  that  an  optimal  value  for  v  exists  and  is  given  by 


"opi  =  (2.2) 


for  which  the  optimal  time  becomes 


tT,l,opt(W)  **  Ntl  +  kpL  +  2s/kNrL/3L 

=  +  \/Wl)7  • 

Observe,  that  in  practice  1  <  v  <  N,  so  that  formula  (2.2)  is  valid  only  when 


(2.3). 


i  <  —PL  <  jv2. 
k  rL  ~ 

Otherwise,  the  optimal  time  simply  becomes 

tT,L,opt(N)  *  (*  +  N){0L  +  rL),  if  J ^  <  1, 

and 

tTj,,opt(N)™k(pL  +  NTL),  if  ~>N2. 

For  example,  assume  that  k  processors  with  transmission  time  tl  and  a  problem  of  size  JV 
are  given.  If,  with  increasing  set-up  time  for  data  transmission,  the  transfer  time  is  to  remain 
optimal,  the  number  of  packets  must  decrease  (while  their  size  increases),  so  that  a  smaller  number 
of  set-up  times  is  required.  Yet,  if  elemental  transmission  and  set-up  time  are  of  the  same  order  of 

magnitude,  then  the  packets  should  each  be  of  size  \f%-  Sometimes  we  will  make  use  of  the  double 
inequality 

Ntl  +  k0L  <  tr,L,opi(N)  <  2 (Ntl  +  k0L).  (2.4) 

Note  that  the  upper  bound  corresponds  to  choosing  the  non-optimal  value  v  =  k. 

Sending  a  vector  to  processors  that  are  not  more  than  a  distance  of  k  <  k  away  changes  the 
optimal  value  of  v  to 

Vopt  =  sj'kNfL 

and  the  corresponding  time  to 


^TyLyOpt 
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(2.5) 


Obviously,  it  does  not  pay  to  divide  a  vector  into  packets  when  using  the  broadcast  bus. 

In  the  case  where  a  vector  v  is  ‘uniformly’  distributed  over  the  k  processors  so  that  the  subvector 
Vi  of  length  ^  resides  in  processor  Pi,  an  important  operation  is  ©tv,  the  direct  sum  of  the  blocks 
vi . . .  vie,  which  makes  the  full  vector  v  available  in  each  processor.  This  obviously  requires  no 
computation  but  only  data  transfers. 
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Using  the  bus,  it  is  possible  to  broadcast  each  v,,  one  after  the  other,  to  all  the  processors 
which  requires  time 

N 

tT,B  -  k&B  +  k(~j^)TB  ~  k0B  +  Ntb- 

When  employing  the  local  links,  the  subvectors  are  ‘rotated’  in  a  roundabout  fashion  :  in 
step  1  we  simultaneously  send  t/i  from  Pi  to  P2,  v2  from  P2  to  P3,  etc,  and  finally  vk  from  P*  to 
Pi;  generally,  in  step  j,  we  transmit  vx  from  Py  to  Py+t,  vj  from  Py+i  to  P;+2,  and  vt  from  Py_i 
to  Pj  (the  indices  should  be  taken  modulo  k).  After  k  of  the  above  steps  v,-  has  encountered  each 
processor.  Hence  the  whole  process  requires  time 

tr,L  =  k0L  +  fc(y)rL  =  k&L  +  Ntl.  (2.6) 

Consequently,  the  number  of  set-up  times  and  elemental  transfer  times  is  the  same  for  bus  and 
local  links. 

In  the  algorithms  for  solution  of  dense  linear  systems,  not  much  difference  will  be  apparent  in 
computation  times  involving  broadcasting  or  pipelined  data  transfer.  Even  though  they  differ  in 
the  number  of  start-up  times,  these  times  (for  broadcasting  as  well  as  local  links)  occur  only  with 
low  order  terms  of  the  problem  size,  N,  and  hence  have  asymptotically  no  influence  on  the  high 
order  terms  which  are  relevant  for  the  overall  time  estimate.  However,  the  final  judgement  can  be 
made  only  when  benchmarks  from  real  multiprocessor  systems  are  available. 

3.  Solution  of  Triangular  Systems 

In  sequential  machines,  a  general  dense  linear  system 

Ax  =  b 

is  efficiently  solved  by  first  reducing  it  to  triangular  form  by  Gaussian  elimination  and  then  solving 
the  resulting  triangular  system.  The  same  approach  will  be  used  for  a  parallel  implementation  on 
the  multiprocessor  ring.  As  is  classically  done,  we  will  start  by  considering  the  solution  of  triangular 
systems. 

3.1.  Partitioning  the  Matrix  into  Blocks  of  Contiguous  Rows 
Consider  the  upper  triangular  system 

Ux  =  b,  (3.1) 

where  U  is  an  upper  triangular  matrix  of  size  N  x  N.  The  simplest  idea  that  comes  to  mind  for 
the  solution  of  such  a  system  on  a  parallel  machine,  is  to  partition  the  rows  of  U  into  k  blocks  each 
consisting  of  N/k  rows  and  to  store  each  block  in  a  processor,  as  shown  in  Figure  2.  Recall  that 
the  processors  are  numbered  consecutively  Pi,P2,..Pfc. 

Let  processor  P  hold  rows  (» - 1)^+  1  to  ij-  of  U ,  the  corresponding  block  t,  of  the  right  hand 
side  vector  b,  and  the  block  x,-  of  the  solution  vector  x.  Accordingly,  denote  by  (7,y  the  N/k  x  N/k 
block  matrix  in  position  (»,j)  of  the  matrix  U.  The  algorithm  TRB  (Triangular  system  solution 
with  Block  Rows)  to  solve  (3.1)  is  shown  below;  proceeding  from  bottom  to  the  top  of  the  matrix 
(i.e.,  »  =  k,k  -  l,...l),  processor  P  solves  the  j-  x  ^  triangular  system  with  the  :th  diagonal 
block  as  coefficient  matrix.  The  solution  vector  x,  is  then  sent  to  the  processors  to  the  left  of  P, 
which  perform  the  corresponding  matrix- vector  multiplications  with  x,. 

Figure  3  graphically  illustrates  the  algorithm,  entries  in  the  matrix  U  denote  the  time  steps 
when  the  corresponding  block  matrix  f7,y  is  processed. 


ALGORITHM  TRB  (Triangular  system  solution  with  Block  Rows) 
1.  Solve  in  P* 

UkkXk  =  bk 


2.  For  i  =  k  —  1,  k  —  2, ...  1  do 

(a)  Send  arl+1  from  P+1  to  Pi,P_!,...Pi 

(b)  For  j  =  1, 2, ...»  do  in  Pj 

bj  :=  6j  —  Py,i+iXi+i 


! 


(3.2) 


(c)  Solve  in  P, 


^iixi  —  6|. 


(3.3) 


At  each  step  »  of  Algorithm  TRB  a  vector  of  length  N/k  must  be  transferred  from  P,+i  to 
Pi,  Pi-i,.  ■ .  Pi  which,  according  to  (2.5),  requires  time 

j  .  2 

~tl  +  ifa  +  2 \J jirL0L  ~  \^J jtl  +  <  2(  +  ifo)  (3.4) 

using  the  local  links  and 

0B  +  (3.5) 

using  the  broadcasting  bus.  Summing  up  over  steps  i  =  k  -  l,k  —  2, . . .  I  yields  the  total  times 
required  for  data  communication 

tT,L  «  Ntl  +  y/?£.  +  ~\j ~TL/3Lk3/2  ■-=  Nrc  +  +  ^k\fNTL/3Lk 

<2  NrL  +  k20L,  (3.6) 

and 

tr.B  %  Ntb  +  k^Bi  (3-7) 

where  the  approximations 

*->  l2  , 

(3.8) 

■«i  i*i 

have  been  employed,  which  are  valid  only  when  k  is  sufficiently  large. 

Now  consider  the  time  spent  doing  arithmetic.  At  each  step  in  (3.2)  a  matrix-vector  multipli¬ 
cation  involving 

N .  N  . 

7h+r') 


(3.9) 


steps  is  performed  in  each  active  processor.  The  cost  of  solving  the  triangular  system  in  (3.3), 


Rx  =  /, 


on  a  pipeline  machine  requires  time 


1 

2 


(3.10) 


because  x,-,  1  <  i  <  y,  is  obtained  by 


(  N/k  \ 

KsH 


and 


xN/k  ~  f N/k/rN/ki 

where  the  indices  are  relative  to  the  sub-block.  The  term  in  brackets  constitutes  an  inner-product 
and  can  be  performed  in  time  (y  -  i  -  l)w  +  If.  The  division  is  incorporated  into  the  pipelining 
of  the  inner-product  so  that  there  is  no  need  for  an  additional  start-up  time.  However,  for  some 
machines  (the  FPS-164,  for  instance)  this  may  have  to  be  revised  as  divsions  are  significantly  more 
expensive  than  additions  or  multiplications.  Summing  (3.9)  and  (3.10)  over  k  -  1  steps  gives  an 
arithmetic  time  of 


3  N2 

l  A  *  2~FW  +  2Nl' 


(3.11) 


3.2.  Partitioning  the  Matrix  into  Blocks  of  Contiguous  Columns 

In  a  block  column-oriented  partitioning  scheme  each  of  the  k  processors  contains  y  adjacent 
columns,  that  is,  processor  P,  contains  columns  (»  —  l)y  +  1  to  of  U,  as  well  as  6,,  where  J7,y 
is  again  the  y  x  y  block  matrix  at  position  (i,j)  of  U.  Although  this  algorithm  will  turn  out  to 
be  the  most  inefficient  one  presented  in  this  paper,  it  is  described  on  account  of  its  simplicity  and 
in  order  to  better  illustrate  related  improved  versions. 

During  each  step,  a  triangular  subsystem  of  order  y  x  y  is  solved,  whereafter  all  matrix  vector 
products  ytj  —  UijXj  for  the  next  higher  block  row  are  performed  in  parallel  and  the  partial  sums  j/,; 
are  sent  to  the  processor  responsible  for  the  subsequent  triangular  system  solution.  Algorithm  TCB 
(Triangular  system  solution  with  Block  Columns)  is  graphically  sketched  in  Figure  4  where  the 
entry  for  U{j  contains  the  time  step  at  which  it  participates  in  a  computation.  The  algorithm  is 
formulated  for  data  transfers  involving  the  local  links,  the  modifications  for  the  bus  are  obvious. 
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Figure  4:  Sketch  of  Algorithm  TCB  for  k  =  8. 

ALGORITHM  TCB  (Triangular  system  solution  with  Block  Columns) 


1.  Solve  in 

Pk 

UkkXk  =  bk 

2.  For  i  = 

k-  1 

,k 

-2,.. 

.1 

do 

(a) 

For  j 

= 

k,  k  - 

1, 

. . . j  +  1  do 

in  Pj 

yij  :=  UijXj 

(b) 

For  j 

= 

fc,  k  — 

1, 

. . .  i  +  1  do 

in  Pj 

Send 

Vij 

to  Pj 

-1 

(c) 

Solve 

in 

Pi 

k 


Uji%i  —  f>i  )  '  yij- 
>»•+! 


(3.12) 


(3.13) 


Observe  that  the  solution  vector  parts  x ,  are  never  transmitted,  only  the  matrix- vector  prod¬ 
ucts  y^.  The  communication  time  with  the  local  links  for  2(b)  in  step  :  comes  to 

(k-i  -  1  )(/?£  + j^) 

since  all  yi;  can  be  sent  at  the  same  time  via  the  local  links.  For  k  —  1  steps  this  makes 

tT,L*\k2pL+\kXTL.  (3.14) 

Because  different  vectors  y,y  must  be  sent  to  one  processor  Pj  at  the  same  time,  broadcasting  bears 
no  advantage  over  the  local  links  and 

tr.B  «  +  ^kNrs- 


»>.,  +  & m'A  ■!  1 
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Figure  5:  Sketch  of  Algorithm  TCBG  for  k  =  8. 


The  arithmetic  time  can  be  easily  determined  by  observing  that  in  each  step  the  matrix-vector 
multiplications  which  are  all  done  in  parallel  are  followed  by  a  triangular  system  solution.  Hence, 
from  (3.9)  and  (3.10)  the  time  per  step  is 


N 


3,1V  „ 


2  k1+  2*  k  * 


and  for  k  steps  it  is  given  by  (3.11).  The  time  for  the  vector  additions  in  2(c)  is 


5*  b  +  T->. 

and  hence, 

1,2,  ,31V2  IV 

tA  -  (2 N  +  -k  )~t  +  (-  —  +  — )w. 

However,  compared  to  method  TRB,  the  communication  time  of  TCB  is  worse  by  a  factor  of 
k.  The  reason  being  that  the  k  -  i  matrix  vector  multiplications  in  (3.12)  of  step  i  are  all  executed 
in  parallel,  and  completed  at  the  same  time.  Consequently,  in  step  2(b)  k  —  i  -  1  vectors  are  sent 
to  one  processor,  which  can  only  receive  them  in  sequence. 

It  would  seem  that  the  computation  time  could  be  improved  by  performing  the  vector  summa¬ 
tions  of  step  2(c)  in  a  ‘logarithmic’  fashion,  since  more  computations  and  transfers  could  be  done 
simultaneously.  In  that  case,  however,  the  distances  to  the  destination  processors  would  increase, 
to  as  much  as  jfc  for  the  last  summation.  The  fastest  way  might  be,  for  a  given  i,  to  compute  the 
first  /,,  f,  <  log(Ar  -  i),  summations  in  logarithmic  fashion  and  then  send  the  partial  sums  and  the 
remaining  log(fc  —  t)  -  /,  vectors  to  one  processor  for  the  computation  of  the  final  y, . 

3.3.  Partitioning  the  Matrix  into  Blocks  of  Contiguous  Columns  :  A  Second  Approach 

Note  that  in  Algorithm  TCB  all  processors  are  waiting  for  the  diagonal  block-system,  which 
is  triangular,  to  be  solved,  so  that  all  matrix  vector  products  of  one  block  row  can  be  computed  in 
parallel.  However,  in  this  second  version  of  the  column  oriented  method.  TCBG  (Greedy  Triangular 
system  solution  with  Block  Columns),  which  might  be  regarded  as  a  “greedy  method',  each  processor 
performs  its  matrix  vector  multiplications  cw  soon  as  possible,  see  Figure  5.  Again,  processor  P, 
comprises  columns  (»  —  l)y  to  of  U. 
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4.6.  Gauss-Jordan  Elimination 

The  Gauss-Jordan  algorithm  is  one  of  the  simplest  approaches  toward  attempting  to  improve 
the  arithmetic  efficiency  of  Gaussian  elimination  on  multprocessors.  We  consider  a  system  par¬ 
titioned  in  k  blocks  of  m  contiguous  rows  as  shown  in  Figure  8.  As  was  observed  earlier,  in  the 
Gaussian  elimination  algorithm  proposed  in  Section  4.1,  we  could  use  the  idle  processors  to  con¬ 
tinue  the  elimination  on  the  rows  above  the  current  pivot  row  thus  maximizing  the  number  of  active 
processors,  at  any  given  step. 

To  estimate  the  transfer  time,  we  observe  that  at  each  step  we  must  send  the  pivot  row  to  all 
processors.  This  results  in  a  time  identical  with  that  of  row  scattered  Gaussian  elimination,  i.e. 
given  by  (4.5).  Similarly,  the  arithmetic  time  is  identical  with  that  of  Gaussian  elimination  with 
block  row  partitioning,  i.e.  it  is  given  by  (4.4). 

The  obvious  advantage  of  this  method  over  that  described  in  Section  4.1,  is  that  we  no  longer 
have  to  solve  a  triangular  system.  Clearly,  scattering  of  the  matrix  across  the  processors  will  not 
result  in  any  gain  here  because  all  processors  are  busy  during  the  whole  elimination  . 

5.  Partial  Pivoting 

So  far,  the  issue  of  pivoting  has  been  put  aside  in  order  to  simplify  the  description  of  our 
algorithms.  In  fact  as  we  now  show,  partial  pivoting  can  be  incorporated,  at  little  extra  cost. 

First  consider  the  block  row  method  described  in  Section  4.1.  At  the  jth  step,  we  need  to 
search  for  the  element  of  largest  absolute  value,  among  the  elements  a,y,i  =  j,..N.  This  can  be 
achieved  in  two  stages.  In  the  first  stage  the  maximum  element  is  found  in  each  processor.  We  refer 
to  these  as  the  local  maxima.  Then  a  comparison  must  be  done  between  these  local  maxima  to 
obtain  the  global  maximum.  The  first  step  costs  where  J  is  the  time  for  doing  one  comparison 
within  any  processor.  For  the  second  stage,  using  nearest  neighbor  connections,  a  round  robin  type 
comparison  between  the  local  maxima  requires  a  communication  time  of  at  most  i(ri  +  0l)  where 
t  is  the  number  of  processors  involved  in  the  j,k  step  of  Gaussian  elimination  ,  i.e.  i  —  k  -  \j/m]. 
The  same  number  of  comparisons  is  also  needed  and  this  requires  time  itj.  A  similar  approach 
using  the  broadcast  bus  is  to  broadcast  each  local  maximum  in  turn  to  the  *  -  1  other  processors 
and  do  the  comparisons  for  the  global  maximum  in  each  of  the  i  processors,  in  parallel.  This  would 
lead  to  similar  times  for  communication  and  for  identical  times  for  arithmetic.  Summing  up  the 
above  over  the  N  -  1  steps  of  the  Gaussian  elimination  algorithm,  one  finds  that  the  overhead  for 
partial  pivoting  in  the  block-row  scheme  is 

tp.BR  *  +  yfc)  w'+  ^y{r  +  0),  (5.1) 

where  r  and  0  represent  either  iq,  and  0i,  or  tb  and  0b  depending  on  which,  the  local  links  or  the 
bus,  is  used. 

For  the  block-column  scheme,  all  of  the  jlh  column  resides  in  one  processor,  so  there  is  no  need 
for  transferring  data.  However,  all  comparisons  are  done  in  one  processor  while  the  others  are  idle. 
It  is  therefore  clear  that  the  resulting  overhead  time  is 

N2  , 

tp,BC  «  ~2~u>-  (5-2) 

The  scattered  scheme  of  Section  3.5  is  similar  to  that  of  the  block  row  scheme,  except  that  the 
comparisons  of  the  second  stage  now  take  place  among  all  k  processors  instead  of  only  i  =  k  -  \jj m] 
as  above.  As  a  result  the  overhead  time  for  pivoting  is  double  that  of  TRB,  i.e. 

tp.BRS  «  i^2k  +  d  +  iVfc(r  +  0).  (5.3) 
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Figure  10:  Block  Diagonal  Scattering  of  a  Linear  System 
Across  Eight  Processors. 

Note  that  the  preceeding  time  is  twice  that  of  most  other  distributions,  because  data  must  be  sent 
in  two  directions,  West-East  and  North-South. 

Using  the  broadcast  bus,  the  only  difference  is  that  each  piece  of  length  m  can  now  be  moved 
simultaneously  to  all  processors  at  the  cost  of  mrg  +  /?g.  The  resulting  total  transfer  time  is 

tr.B  »  iV2rs  +  Nkdg- 

We  consider  now  the  arithmetic  complexity  of  this  method.  In  step  j  of  the  algorithm  we 
perform  4V  -  j  eliminations  each  of  which  is  split  into  several  linear  combinations  of  vectors  of 
length  m  taking  place  in  a  different  processor.  The  processor  most  delayed  in  performing  these 
linear  combinations  is  Pi  which  holds  the  diagonal  blocks.  Processor  Pi  performs  exactly  N  —  j 
linear  combinations  of  length  m  each.  Hence  the  arithmetic  time  at  step  j  is 

{N  -j){mu  +  ~t). 


and  the  total  arithmetic  time  is 

N3  N2 

tA"32kUJ  +  T1'  (4'8) 

We  note  that  it  is  possible  to  extend  this  scheme  by  considering  blocks  of  size  mxm  with 
1  <  m  <  N/k  and  scattering  them  by  diagonals,  i.e.  by  assigning  all  the  blocks  on  the  same 
diagonal  cyclically  to  processors  Pi, Pi,  ..P*,Pi,  P2,  ...P*, ...  The  purpose  of  this  scattering  is  to 
improve  efficiency  by  chosing  the  parameter  m  so  that  the  total  computation  time  is  minimized. 
However,  this  will  not  be  considered  as  it  leads  to  complicated  formulas  and  does  not  result  in  a 
significantly  better  algorithm. 
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Using  the  approximation 


and  summing  over  j ,  we  have 


(IV -i) 


1 JV3  1  jv2 
^3r+2T7' 


Observe  that  as  before  the  start-up  time  is  reduced  by  a  factor  of  k.  Abo  note  that  the  above 
formula  is  only  valid  for  k  <<  JV,  because  approximation  (4.6)  was  used. 

Scattering  the  columns  of  the  matrix  across  the  processors  does  not  change  the  communication 
time  (4.5).  Analogous  to  (4.7),  the  arithmetic  time  'is 

JV3  JV2 

uu+—~1’ 

where  the  start-up  times  are  independent  of  k. 

The  communication  time  for  scattered  partitioning  b  always  larger  than  that  of  the  ‘contiguous’ 
partitioning  of  section  4.1.  For  large  a,  it  b  roughly  twice  as  big.  When  a  b  small,  the  two  times 
are  comparable.  Furthermore,  scattering  makes  the  arithmetic  computation  time  consbtent  with 
the  sequential  case,  ie,  for  Jfc  =  1  the  arithmetic  time  reduces  to  that  of  the  sequential  evaluation. 

4.4.  Reduction  to  a  DDB  scattered  matrix 

As  was  mentioned  in  Sections  3.4  and  3.6,  it  is  important  to  have  diagonal  blocks  that  are 
diagonal  matrices.  Such  matrices  can  be  obtained  in  the  same  time  as  regular  upper  triangular 
systems  by  simply  taking  advantage  of  the  idle  time  of  the  processors  in  the  regular  versions.  The 
result  b  a  triangular  system  whose  solution  requires  less  communication  and  start-up  times,  see 
Section  3. 

4.5.  Partitioning  the  Matrix  into  Block  Diagonals 

Now  consider  a  scheme  which  leads  to  the  diagonal  scattering  of  Section  3.7  by  partitioning 
the  linear  system  into  square  blocks  of  size  m  x  m  each,  where  m  =  N/k.  Again,  A,;  represents  the 
block  matrix  in  position  (i,j)  of  A.  The  data  are  scattered  so  that  block  j  belongs  to  processor 
number  1  +  [(i  —  j)  mod  Jk],  1  <  i,j  <  k.  The  above  distribution  is  illutrated  in  Figure  10,  where  a 
matrix  entry  denotes  the  processor  to  which  the  corresponding  block  matrix  b  assigned.  The  right 
hand  side  6  can  be  considered  as  an  additional  column  of  A  and  b  dbtributed  accordingly  among 
the  processors. 

The  motivation  for  considering  such  dbtributions,  b  the  ease  with  which  local  link-data  trans¬ 
fers  can  be  overlapped,  since  any  two  contiguous  blocks  of  A  belong  to  neighboring  processors. 

At  each  step  j  of  Gaussian  elimination,  the  multipliers  must  be  transferred  downward.  All 
processors  holding  that  row  will  simultaneously  communicate  to  those  beneath  them  one  piece  of 
size  m  =  N/k  of  that  row.  Thb  b  possible  because  of  the  way  the  matrix  b  dbtributed.  Only 
transfers  from  processor  Pi  to  processor  mod  *)  are  necessary.  These  transfers  are  repeated 

until  each  piece  reaches  the  processor  which  holds  the  corresponding  piece  of  the  last  block  row. 
Thb  requires  exactly  [(iV  -  j)/m\  such  transfer  steps.  Each  of  these  steps  consists  of  sending 
in  parallel  from  one  processor  to  a  direct  neighbor  a  vector  of  length  N/k.  Hence  the  time  for 
communicating  the  jlh  row  b  approximately 

— — -{mTL  +  0l)  =  (N  -  j)ri  +  — — -0i. 

m  m 

Once  the  pivot  row  b  available  in  all  processors,  the  pivots  also  need  to  be  transmitted,  i.e.,  the 
jth  column  to  the  right.  Clearly,  thb  involves  the  same  amount  of  time  as  above.  All  this  results 
in  a  total  transfer  time  of 

tr,L  *  N2tl  +  Nk0i. 
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Figure  9:  Gaussian  elimination  with  Scattered  Rows  for  k  =  4. 

4.3.  Scattering  the  Rowe  and  Columns  of  a  Matrix 

As  in  Section  3.5,  the  matrix  A  is  partitioned  into  blocks  of  k  rows  or  columns.  However,  now 
the  processors  do  not  contain  blocks  of  contiguous  rows  or  columns,  but  the  rows  (or  columns) 
are  scattered  ‘cyclically*  across  processors  Pi  ...P*.  This  scheme  will  be  referred  to  as  ‘scattered 
Gaussian  elimination.’ 

The  scattering  of  rows  is  depicted  in  Figure  9.  At  step  j,  the  pivot  row  must  be  available  to 
all  processors  for  purposes  of  elimination  .  Sending  a  row  of  length  N  —  j  to  all  processors  (see 
Section  2)  takes  time 

iN-frL  +  k(lL  +  2y/kfa[y/ini. 

Summing  this  expression  for  j  =  1,2, ...N  -  1  and  using  the  approximations  (3.8),  we  find  an 
approximate  total  communication  time  of 

tr,L  »  +  kN0i  +  2 s/k0LTc^  (iV3/2)  =  ^-rL  ^1  +  +  2a2^  ,  (4.5) 

where  a  is  defined  by  (4.3). 

As  for  arithmetic,  there  are  N  -  j  elements  to  be  eliminated  during  step  j,  and  since  the 
rows  are  scattered  across  the  k  processors,  each  processor  will  perform  about  f(/V  —  j)/k]  linear 
combinations  at  the  cost  of  7  +  (iV  —  j)u>  each.  Therefore,  step  j  consumes  time 

+  (*-;>]. 
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which  can  be  written  as 


(4.2) 


with 


*T,L  »  ~2Tl  (*  +  a)2  ' 


-JIE. 

V  N  rL 


(4.3) 


Using  the  bus  to  broadcast  a  row  of  length  (N  -  j)  to  an  arbitrary  number  of  processors 
requires  the  time 

0B  +  (iV  ~  j)TB 

and  therefore  the  total  communication  time  when  using  the  bus  is  given  by 


<t,b  =  N0b  +  iiV2rs. 


To  determine  the  time  for  arithmetic,  we  note  that  at  step  j  a  processor  performs  at  most  y 
elimination  s  which  takes  time 


Summing  up  over  JV  —  1  steps, 


tA 


N3 

»  — W  + 


(4.4) 


The  reduction  of  the  start-up  time  N2~i  by  the  factor  k  in  the  above  formula,  comes  from  the  fact 
that  the  active  processors  simultaneously  eliminate  (at  most)  ^  elements  per  column  by  computing 
linear  combinations  of  rows  of  length  N  -  j. 


4.2.  Partitioning  of  th«  Matrix  into  Blocks  of  Contiguous  Columns 

Similarly,  if  the  matrix  is  divided  up  into  blocks  of  contiguous  columns,  then  P,  contains 
columns  (i  -  1)  j  to  of  A.  For  simplicity,  the  vector  6  is  considered  to  be  another  column  of 
A  and  hence  stored  in  P*.  At  step  j,  column  j,  which  contains  the  multipliers  is  in  P,,  must  be 
transmitted  to  P,+i  . .  .P*.  This  consumes  roughly  the  same  amount  of  communication  time  as  for 
the  block  row  partitioning,  (4.2)  and  (4.3).  The  arithmetic  operations  during  step  j  take  time 


(n  -  j)b  +  j<*>) 

as  each  processor  forms  N—j  ''near  combinations  of  rows  of  length  y .  The  total  time  for  arithmetic 
operations  comes  to 

1 N3  1  » 

ST “+  2,V  T 

Unlike  (4.4),  the  start-up  time  is  not  reduced  by  k,  since  the  number  of  elements  each  processor 
has  to  eliminate  per  column  is  independent  of  k. 

Recall  that  on  a  sequential  machine  the  time  for  Gaussian  elimination  is  proportional  to  %N3<jJ. 
In  the  preceeding  schemes,  use  of  k  processors  will  not  speed  up  the  computation  by  a  factor  of 
k,  no  matter  how  fast  the  communication,  because  processors  are  often  idle.  There  are  several 
ways  of  improving  the  efficiency  of  these  algorithms.  We  could  keep  processors  busy  by  having  idle 
processors  continue  the  elimination  on  rows  above  the  pivot  row  instead  of  remaining  inactive;  this 
is  the  Gauss-Jordan  method,  to  be  discussed  later.  An  alternative  is  scattering  of  rows  or  columns 
across  processors  as  was  done  already  for  the  solution  of  triangular  systems. 
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Figure  8:  Gaussian  Elimination  on  a  Block  Row  Partitioned 
Matrix. 


4.  Gaussian  elimination 

In  this  section  we  describe  parallel  implementations  of  Gaussian  elimination  on  a  dense  IV  x  N 
matrix  A  for  solving  the  linear  system 

Ax  =  b.  (4.1) 

It  is  assumed  that  no  pivoting  is  required.  The  issue  of  pivoting  will  be  discussed  in  the  next 
section. 


4.1.  Partitioning  the  Matrix  into  Blocks  of  Contiguous  Rows 

The  simplest  way  to  implement  Gaussian  elimination  is  to  subdivide  the  matrix  A  into  k  blocks 
of  y  rows  each  and  assign  one  such  block  to  each  processor,  cf  Section  3.1.  Let  processor  P,  hold 
rows  (»  —  1)^  -+-  1  to  tj-  of  A  and  the  corresponding  components  of  the  right  hand  side  vector  6, 
see  Figure  8. 

If  at  step  j,  row  j  is  stored  in  Pi,  it  must  be  sent  to  Pi+i  ...P*  in  order  to  perform  the 
eliminations  in  each  of  them.  In  section  2  it  was  shown  that  if  only  the  local  links  are  used,  then 
the  transfer  of  a  row,  j,  of  length  N  -  j  to  is  k  -  i  processors  approximately  requires  time 


where  i  depends  on  j 


(\/(iV  -j)tl+  , 


This  yields  an  approximate  communication  time  for  ste{)  j  of 


(iV-» 


:  + 


After  summation  from  j  =  1  to  N  -  1,  this  yields 


Method 

tA 

tr,L 

It,b 

TRB 

+  2N~j 

2  Ntl  +  k20L 

Ntb  +  k0B 

TCB 

(¥  +  f)"  +  (¥  +  *"h 

Ti  +  *t0l 

^j-Tg  +  y@B 

TDB 

+  2N~i 

3  Ntl  +  %0L 

re  +  §0B 

TRB/DDB 

+  2  N't 

2  Ntl  +  k20L 

Ntb  +  k(3g 

TCB/DDB 

(¥+t)«+(¥+«0i 

<frL  +  fife 

TB  +  y@b 

TDB/DDB 

^w  +  2  N'l 

3  Ntl  +  %0L 

+  §gB 

TRS 

(%F  +  +  ($*  +  N)*i 

kNrc  +  kNfii 

Ntb  +  Nfis 

TCS 

(®  +  ^  +  2^)w+(T  +  3iVh 

%tl  +  2  N0L 

(¥•  +  *r)TB  +  (*f  +  N)fo 

TRS/DDB 

AT*.  .  .  JV2 

■JFW+  2PT-7 

Ntl  +  N0l 

Ntb  +  N0b 

TCS/DDB 

(S  +  fV  +  (J  +  Afh 

%tl  +  N0l 

%tb  +  N0b 

TRBG 

+  2Ni 

2  Ntl  +  2k  0l 

kNTB  +  k20B 

TCBG 

2^w  +  21V'y 

2  Ntl  +  2  k0L 

kNTB  +  k20s 

Table  1:  Computation  Times  for  Several  Triangular  System 
Solution  Algorithms. 
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•  When  the  number  of  processors,  Jb,  is  much  smaller  than  N  (with  the  exception  of  the  column 
scattering  scheme,  TCS)  communication  times  are  of  lower  order  than  arithmetic  times. 

•  In  general,  the  scattered  schemes  seem  to  show  better  arithmetic  than  communication  perfor¬ 
mance.  This  is  because  fewer  processors  are  idle  than  in  block  methods,  which  in  turn  results 
in  increased  data  transfers. 

•  Broadcasting,  it  appears,  is  best  done  with  row-oriented  schemes  and/or  ‘greedy’  methods.  For 
a  small  number  of  processors  TRBG  and  TCBG,  and  otherwise  TRS/DDB  are  to  be  preferred. 

•  Communication  on  local  links  is  fastest  with  row-oriented  schemes,  such  as  TRB/DDB  and 
TRB,  as  well  as  with  the  diagonal  block  method  TDB.  For  a  large  number  of  processors, 
k  &  N,  the  row  scattering  schemes  also  fare  well. 

•  The  merit  of  an  algorithm  regarding  arithmetic  depends  very  much  on  the  relation  of  pipe-fill 
times  to  elemental  operation  times  and  the  number  of  processors.  For  example,  for  non- 
pipelined  machines  the  row  scattering  scheme,  TRS/DDB  seems  to  be  most  attractive.  For 
large  pipe-fill  times  and  a  small  number  of  processors  the  block  DDB  methods  look  good. 

•  The  block  schemes  might  be  better  suited  for  pipelined  machines  than  the  scattering  schemes 
since  their  coefficients  for  pipe-fill  times  are  lower. 

•  DDB  matrices  do  not  improve  the  performance  of  greedy  methods,  TRBG  and  TCBG,  since 
simultaneous  matrix-vector  multiplications  conceal  the  improvement  in  triangular  system  so¬ 
lution. 

•  Disappointingly,  the  diagonal  block  scheme,  TDB,  did  not  deliver  the  expected  compromise 
between  block  schemes  (faster  communication)  and  scattering  schemes  (faster  arithmetic). 

To  summarize,  for  fast  solution  of  triangular  systems  on  a  multiprocessor  ring,  row-oriented  methods 
seem  to  be  superior  to  column-oriented  algorithms.  In  particular,  when  communicating  on  local 
links  the  block  row  DDB  method,  TRB/DDB,  and  also  TDB/DDB,  are  recommended;  if  the  number 
of  processors  is  proportional  to  IV,  then  the  row  scattering  algorithms,  TRS  and  TRS/DDB  should 
also  be  considered.  A  row-oriented  scheme,  TRS/DDB,  or  greedy  scheme,  TRBG  or  TCBG,  should 
be  chosen  when  data  transfer  is  done  via  broadcasting. 


using  the  local  links.  However,  this  expression  can  be  simplified  by  using  the  upper  bound 
2(yT£,  +  i@i)  derived  from  (2.4),  which  corresponds  to  splitting  the  data  into  a  non-optimal 
number  of  packets.  Using  the  broadcast  bus,  the  transfer  time  is  just 


*  H 

0B  + 

•  In  step  2(b),  y  matrix-vector  multiplications  of  length  y,  in  time 

N.  N  . 

•  In  step  2(c),  sending  the  result  of  step  2(b)  from  one  processor  to  its  neighbor  can  be  done 
using  the  local  links  simultaneously  for  all  processors  in  time 

.  N 

PL  +  ~£TL, 

but  must  be  sequenced  when  using  the  bus, 

.JV 

10B  + 


•  Solving  the  system  (3.27),  which  requires  time 

JV.  1  iV 

Th  +  2T">- 


Hence,  the  communication  and  arithmetic  times  for  algorithm  TDB  are 


tr,L  **  3i Vrt  -(-  ^k20i 

tT.B  «  \kNrB  +  ~ k2rB 
3  N2 

tA  ^  +  2N~t- 


The  arithmetic  and  local  link  communication  times  are  comparable  to  those  of  the  block  row 
method,  TRB,  while  the  broadcast  transfer  time  is  the  same  as  for  the  block  column  algorithm, 
TCB. 

Analogously  to  the  other  block  schemes,  for  a  DDB  matrix  only  the  arithmetic  time  is  reduced. 


to 


N2 

tA  «  -£-W  +  2JV7. 


3.8.  Summary 

The  complexity  bounds  for  the  various  algorithms  considered  in  this  section  are  summarized 
in  Table  1.  Under  the  assumption  that  high  order  terms  realistically  reflect  the  time  behavior  of 
the  methods,  the  following  conclusions  can  be  drawn. 


Figure  7:  Block  Diagonal  Scattering  of  a  Triangular  System 
for  k  =  8. 


ALGORITHM  TDB  (Triangular  system  solution  with  Diagonal  Blocks) 

1.  Solve  in  P* 

Ukk*k  —  h 

2.  For  i  *  k  -  1,  k  -  2, . . .  1  do 

(a)  Send  x,+l  from  P*  to  P*_i ,  P*_2  . . .  P*-, 

(b)  For  j  =  A:  -  i,  k  -  i  +  1, . . .  k  -  1  do  in  Pj  (y;t  s  6;) 

yj+u  -=  y>,i+i —  Uj,i+\xi+\  (3.26) 

(c)  For  j  =  1, 2, ...»  do  in  Pj 
Send  yj+u  to  P;+1 

(d)  Solve 

IfjjXj  —  3 lk,i-  (3.27) 


At  each  iteration  in  the  above  algorithm  the  following  has  to  be  done 

•  In  step  2(a),  transfering  the  vector  x,+i  from  processor  P*,  where  it  has  just  been  computed, 
to  i  other  processors,  in  time 


Adding  (3.22)  and  (3.21)  results  in  an  arithmetic  time  of 


IN2  IN2 

tA  *  (2 T  +  *)"  +  (j  **  +  *b-  (3-23) 

In  a  comparison  with  methods  TRB,  TCB  and  TCBG  on  the  one  hand,  the  coefficient  for  the 
elemental  operation  time  u>  in  TRS  is  the  smallest;  on  the  other  hand,  the  coefficient  for  the  pipe-fill 
time  7  is  increased  by  j(y)2,  which  makes  the  contribution  of  7  in  TRS  always  larger  than  in  the 
other  schemes. 

3.6.  Scattering  the  Rows  of  a  Diagonal-Diagonal-Block  Matrix 

Scattering  the  rows  of  a  DDB  matrix  decreases  both  the  arithmetic  and  the  communication 
times.  For  each  of  the  ^  steps,  the  time  (3.21)  can  now  be  subtracted  from  the  total  arithmetic 
time,  yielding 

+  (3'24) 

Since  no  triangular  system  has  to  be  solved,  each  processor  contains  exactly  one  element  of  the 
vector  Xi+i  in  (3.18).  Yet,  prior  to  performing  the  operations  in  (3.17),  the  entire  vector  x,+x  must 
be  made  available  in  all  processors.  Thus,  a  direct  sum  of  k  ‘vectors’  of  length  one,  as  described 
in  section  2  ought  to  be  performed.  From  (2.6)  the  transfer  time  for  operations  (3.17)  is  therefore 
k(ji  +  0l)  per  step  on  the  local  links.  For  all  y  steps  this  makes 


tT,L  &  Ntl  +  N (3i . 


(3.25) 


For  the  broadcast  bus,  the  time  is  obviously  similar, 


tr,B  «  Ntb  +  N0b- 


Observe  that  only  here  and  in  algorithm  TRS  with  the  broadcast  bus,  neither  the  transfer 
nor  the  arithmetic  time  increases  with  the  number  of  processors,  k.  Among  the  methods  discussed 
so  far,  the  contribution  of  the  elemental  operation  time  w  in  the  DDB  scattering  scheme  is  the 
smallest  while  that  of  the  pipe- fill  time  7  is  the  largest  when  k  <  \/N. 

3.7.  Partitioning  of  the  Matrix  into  Block  Diagonals 

Scattering  of  block  diagonals  instead  of  rows  or  columns  will  be  considered  in  this  section. 
The  matrix  is  partitioned  into  blocks  of  size  |x  j.  A s  before,  Ufj  represents  the  block  matrix  in 
position  (i,j)  of  U.  The  matrix  is  scattered  so  that  processor  P*-,  contains  block  superdiagonal  i 
of  U,i  =  0,  ..fc  -  1;  in  particular,  the  main  block  diagonal  (superdiagonal  0)  is  contained  in  P*. 

Formally,  P*_,-  contains  block  matrices  Ujj+i ,  for  j  —  1, 2, ..,  k  —  i.  This  scheme  is  described  in 
Figure  7,  where  the  matrix  entries  denote  the  processor  in  which  the  corresponding  block  matrix  is 
contained.  Initially,  corresponding  elements  of  the  right  hand  side  b  and  the  last  column  of  U  reside 
in  the  same  processors.  The  triangular  system  can  then  be  solved  with  algorithm  TDB  (Triangular 
system  solution  with  Diagonal  Blocks). 
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ALGORITHM  TRS  (Triangular  System  Solution  with  Scattered  Rows) 


1.  Solve 


Ummxm  =  bm,  where  m  =  — 


2.  For  i  =  £  -  1,  £  -  2, ...  1  do 

(a)  For  j  ~  1,2, ...i  do 

(b)  Solve 


bj  bj  —  Ujj+iXi+i 


U 1,'^t  —  bj. 


(3.17) 


(3.18) 


There  are  two  main  tasks  in  the  loop  of  the  above  algorithm,  solving  the  k  x  k  triangular 
systems  (3.18)  and  performing  the  matrix-vector  multiplications  (3.17). 

Note  that  during  the  triangular  system  solution  each  row  of  the  matrix  (7„  system  is  contained 
in  a  different  processor.  Therefore  one  can  use  the  results  of  Section  3.1  with  N  —  k,  i.e.  j-  =  1. 
Observe,  that  it  is  most  efficient  to  send  each  newly  found  element  of  the  solution  vector  directly 
to  all  other  processors.  However,  since  only  one  scalar  at  a  time  is  transmitted,  the  data  transfer 
time  comes  to  k(ri,  +  0i)  for  the  local  links;  for  each  triangular  system  the  amount  of  transfers 
comes  to 

k2(TL  +  0i). 

Since  broadcasting  does  not  depend  on  the  number  of  processors  to  be  addressed,  the  solution  of 
(3.18)  requires  a  number  of  data  movements  proportional  to 

k(TB  +  0b)- 

For  j-  linear  systems  the  communication  cost  is  thus 

tr,L  »  Nkfa  +  0i)  (3.19) 


N{tb  +  Pb). 


(3.20) 


Once  the  system  (3.18)  is  solved,  each  processor  contains  all  known  elements  of  the  solution  vec¬ 
tor.  In  contrast  to  the  previous  schemes  TRB,  TCB  and  TCBG,  which  partition  the  matrix  into 
contiguous  parts,  the  coefficient  of  the  start-up  times  0  in  the  scattering  scheme  grows  with  the 
problem  size  (for  broadcast  bus  as  well  as  local  links). 

The  arithmetic  time  for  the  solution  of  a  k  x  k  triangular  system  is  k(* 7  +  w),  resulting  in  a 
total  of  approximately 

2V(7  +  w).  (3.21) 

Once  the  vector  ar,+i  is  available  in  all  processors,  each  processor  will  subtract  from  its  com¬ 
ponent  of  bj  the  inner-product  of  its  row  of  U  with  ar,+i  in  time  7  +  kur,  j  —  1,2,...  i.  Hence  the 
arithmetic  time  in  step  i  of  algorithm  TRS  is  i(i  +  kw)  and  the  total  over  all  steps  in  (3.17)  comes 


+  ku). 


(3.22) 
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Figure  6:  Scattering  the  Rows  of  a  Triangular  System  for  k  =  4. 


Note,  that  the  computation  time  is  not  diminished  when  TCBG  is  applied  to  a  DDB  matrix, 
because  the  simultaneous  matrix-vector  multiplications  conceal  the  improvement  in  the  triangular 
system  solution. 


3.5.  Scattering  the  Rows  of  the  Matrix 

The  previous  algorithm  is  arithmetically  not  efficient  since  many  processors  are  idle  during 
an  important  part  of  the  process.  A  remedy  is  to  simply  scatter  the  rows  of  the  matrix  U  across 
the  processors  in  a  cyclic  way  so  that  the  work  is  divided  more  evenly  and  processors  become 
idle  only  during  the  last  k  steps  (this  is  termed  ‘torus  wrap’  in  [5]).  Clearly,  one  can  expect  the 
communication  time  to  increase  somewhat,  while  the  arithmetic  time  should  decrease. 

Let  the  rows  of  U  be  scattered  in  such  a  way  that  Pi  contains  rows  t,  i+k,  i+2fc, . . .  i+ ( ^  -  l)fc. 
This  time  the  matrix  U  is  divided  up  into  blocks  of  size  k  each  (separated  by  bold  lines  in  Figure  6) 
instead  of  ^  as  in  Section  3.1.  Let  b\ ,  bj, . . . ,  bpr/k  and  x\ ,  xj, . . . ,  xs/k  be  the  blocks  of  the  right  hand 
side  b  and  the  solution  x,  respectively,  corresponding  to  the  above  partitioning.  After  all  processors 
have  participated  in  the  triangular  system  solution,  they  perform  matrix-vector  multiplications 
with  the  newly  found  part  of  the  solution  vector  which  each  of  them  contains.  Algorithm  TRB  is 
modified  as  follows. 
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Proceeding  from  bottom  to  top  of  the  matrix, »  =  k,k  —  1, . . .  1,  processor  Pi  solves  the  x  x  T 
triangular  system  associated  with  the  ith  diagonal  block  and  then,  successively,  performs  a  matrix- 
vector  multiplication  with  *,•  and  each  matrix  block  ‘higher  up’  in  the  column.  In  the  mean  time, 
the  neighboring  processor  to  the  left  can  start  solving  its  triangular  system,  followed  by  matrix- 
vector  multiplications  of  all  matrix  blocks  in  the  corresponding  column  with  the  solution  vector 
part. 

All  steps  of  Algorithm  TCBG  consist  of  a  triangular  system  solution  followed  by  matrix-vector 
multiplications  and  additions,  after  each  of  which  at  most  k/ 2  processors  simultaneously  transfer 
a  vector  of  length  j-  via  the  local  links  to  their  left  neighbor  in  time 

JV 

2  jtl  +  2  0L. 

Since  this  communication  takes  place  only  between  pairs  of  processors,  the  vector  is  not  divided 
into  packets  of  smaller  size  and  the  number  of  start-up  times  is  reduced  by  more  than  a  factor  of 
k  compared  to  TRB  and  TCB.  After  k  -  1  steps  the  time  for  data  communication  is  about 

tr,L  «  2 Ntl  +  2 k0L. 

In  case  of  broadcasting  no  simultaneous  transfer  is  possible  anymore  and  the  upper  bound  per  step 
increases  to 

4(T  TB  +  0B)  =  NTB  +  k0B, 
bringing  the  total  time  for  data  exchange  to 

tT,B  »  kNrB  +  k20B, 

which,  as  expected,  exceeds  by  a  factor  of  k  the  communication  time  involving  local  links. 

The  communication  time  in  TCBG  with  local  links  is  superior  by  a  factor  of  k  to  the  one  in 
TCB.  If  it  comes  to  broadcasting,  TRB  is  the  preferred  scheme.  Hence,  the  row-oriented  scheme 
would  benefit  from  broadcasting  while  the  column-oriented  scheme  would  be  better  off  with  data 
exchange  on  the  local  links. 

For  the  arithmetic  operation  time,  note  that  in  contrast  to  TCB,  matrix-vector  multiplications 
and  linear  system  solutions  are  overlapped.  As  the  processors  are  assumed  to  work  in  lockstep  and 
a  matrix-vector  multiplication  needs  twice  the  amount  of  time  of  a  triangular  system  solution  of 
the  same  size,  the  arithmetic  operation  count  is  proportional  to 

jy2 

tA  &  2—u +  2N~i,  (3.15), 

which  is  larger  than  the  one  for  TRB  or  TCB.  Moreover,  at  any  given  time  never  more  than  half 
of  the  processors  are  active,  cf  Figure  5. 

3.4.  Partitioning  a  Diagonal- Diagonal- Block  Matrix  into  Contiguous  Blocks  of  Rows  or  Columns 
The  solution  of  the  triangular  systems  in  TRB  and  TCB  can  be  avoided  and  the  arithmetic 
time  reduced  by  about  50%  when  the  diagonal  blocks  of  U  are  ^  x  £  diagonal  matrices.  Such 
matrices  are  referred  to  as  diagonal-diagonal-block  (DDB)  matrices.  The  communication  times 
remain  the  same  while  the  arithmetic  time  for  both  TRB  and  TCB  is  reduced  to 
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Note  that  for  the  row-oriented  schemes,  as  is  the  case  on  a  sequential  machine,  there  is  no 
need  to  actually  permute  the  pivot  row  with  the  jth  row,  i.e.  to  physically  move  the  two  rows  to 
new  locations  among  the  processors.  All  that  is  needed  is  to  associate  with  each  row  an  integer, 
indicating  the  actual  position  of  this  row  with  respect  to  the  resulting  upper  triangular  system. 
The  result  will  be  an  upper  triangular  system  whose  rows  are  scattered  in  a  random  way  accross 
the  processors,  each  processor  holding  N/k  rows.  However,  this  will  not  only  complicate  the  orga¬ 
nization  of  the  solution  of  the  resulting  triangular  system  but  will  also  increase  the  communication 
time. 

To  be  concrete  let  us  outline  an  analogue  of  the  TRB  algorithm  of  Section  3.1  for  solving 
an  upper  triangular  system  whose  rows  are  randomly  scattered  as  a  result  of  the  above  pivoting 
technique.  The  first  step  consists  in  computing  the  last  component  £,v  of  the  solution  x  (cost 
=  7  +  w),  in  the  processor  holding  the  last  row  (therefore  each  processor  must  check  whether  it 
contains  the  last  row  but  we  neglect  the  cost  for  these  tests).  Then  the  component  is  sent  to 
all  processors  (cost  =  Ar(/?r,  +  r i)  ;  (0g  -f  rg)),  which  will  then  perform  the  analogue  of  step  2.(b) 
of  Algorithm  TRB,  i.e.  they  will  subtract  from  the  right  hand  side  their  part  of  the  last  column 
of  U  times  the  last  component  £,v  just  received.  Since  each  processor  holds  at  most  N/k  elements 
of  any  column  of  U,  this  will  take  at  most  jy  +  7.  Next  f.v-i  is  computed  in  some  processor 
and  the  above  is  repeated.  Because  of  the  random  scattering  of  the  rows,  each  component  must 
be  sent  to  all  processors  at  any  step.  Also,  since  the  number  of  rows  contained  in  each  processor 
is  only  known  to  be  at  most  N/k  at  any  given  step,  the  arithmetic  time  required  for  each  step  is 
upper  bounded  by  fw  +  7.  If  we  sum  up  over  N  steps,  the  total  time  for  solving  such  a  randomly 
scattered  triangular  system  comes  to 

tT,L  «  kN{TL  +  0L) 


for  communication  and 


tr.B  **  N(rg  +  0g) 


<  (-£-  +  N)u  +  2Nl, 


for  arithmetic.  Observe  the  increase  of  the  contribution  from  latencies  in  the  communication  time 
when  the  local  links  are  used. 

The  complexity  of  the  algorithm  itself  is  a  non-negligible  difficulty.  However,  the  column 
schemes  do  not  present  these  drawbacks  since  the  rows  can  be  permuted  without  any  data  move¬ 
ment.  The  Gauss-Jordan  algorithm  will  not  lead  to  the  above  difficulties  as  the  resulting  system 
is  diagonal,  but  it  will  remain  potentially  unstable.  In  view  of  the  fact  that  the  above  costs  for 
solving  triangular  systems  are  small  in  comparison  with  those  of  Gaussian  elimination,  it  appears 
that  on  the  whole  the  column  scheme  is  by  and  large  more  attractive  when  pivoting  is  necessary. 


8.  Discussion 

The  performances  of  the  Gaussian  elimination  algorithms  considered  in  Section  4  are  summa¬ 
rized  in  Table  2.  The  following  observations  can  be  made  by  examining  the  results  of  Sections  3, 
4,  and  5. 

•  The  communication  times  are  low  order  terms  as  compared  with  arithmetic  times  when  k  << 
N. 

•  Both  the  arithmetic  times  and  the  communication  times  of  those  of  the  triangular  system 
solution  algorithms  are  low  order  terms  in  comparison  with  those  of  Gaussian  elimination. 
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Method 

tA 

lT,L 

tT,B 

GE/BR 

Ti(l  +  a )2 

*TTB  +  N0B 

GE/BC 

^L(l  +  a)2 

%tb  +  N0b 

GE/RS 

TF'I 

N'2ri(l  +  |a  +  2o2) 

%Tb  +  S0B 

GE/CS 

N2tl(  1  +  |a  +  2a2) 

*F TB  +  X0B 

GE/DDB 

JVS,  ,  .  JV* 
3£uJ+  TFT 

N2tl(  1  +  §a  +  2a2) 

^tb  +  N0B 

GE/DS 

fw+f-, 

N2tl  +  Nk0i 

^tb  +  kN0B 

GJ 

+  N2-, 

N2tl(  1  +  fa  +  2a2) 

STTB  +  N0b 

Notes: 

1)  a=  \J& 

2)  We  have  the  following  upper  bounds: 

^rt(l  +  a)2  <  N2rL  +  kN0L. 

N2tl{1  +  fa  +  2a2)  <  N2tl  +  2 kN0L. 

Table  2:  Timings  for  several  Gaussian  elimination  algorithms 

•  The  scattered  schemes  have  better  arithmetic  performances  but  worse  communication  perfor¬ 
mances. 

•  The  communication  times  are  better  with  the  bus  than  with  local  links,  which  is  to  be  expected 
since  pivot  rows  must  be  broadcast  to  several  processors  at  any  given  step. 

•  The  diagonal  scattering  of  data  results  in  poor  overall  performance. 

•  The  overhead  of  pivoting  is  small  as  compared  with  the  cost  of  Gaussian  elimination.  It  is 
however  of  the  sarnie  order  of  magnitude  as  that  of  triangular  system  solution. 

•  Pivoting  is  less  expensive  for  the  row-oriented  schemes. 
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